Optimization terminated successfully.
Current function value: 0.574351
Iterations 5
EBA3500 Data analysis with programming
Jonas Moss
BI Norwegian Business School
Department of Data Science and Analytics
Note
Optimization terminated successfully.
Current function value: 0.574351
Iterations 5
To intepret the coefficients we must exponentiate them.
Hence increasing gpa with 1 increases the probability of being admitted by \(60\%\), everything else held equal. Increasing the rank of the university by one decreases the probability by \(30\%\) though.
\[f(\boldsymbol{x};\boldsymbol{\mu},\Sigma)=\frac{\exp\left(-\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{\mu}\right)^{T}\Sigma^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}\right)\right)}{\sqrt{(2\pi)^{k}|\Sigma|}}\]
\(\boldsymbol{x}\in\mathbb{R}^k\) is the vector of observations
\(\Sigma\) is the \(k\times k\) covariance matrix.
\(\boldsymbol{\mu}\in\mathbb{R}^k\) is the vector of means. * \(|A|\) denotes the determinant of \(A\).
Theorem 1 (Multivariate central limit theorem) Let \(\boldsymbol{X}\in\mathbb{R}^k\) be multivariate random variable with mean \(\mu\) and covariance matrix \(\Sigma\). Then \[\sqrt{n}(\overline{\boldsymbol{X}}-\boldsymbol{\mu}) \to N(0, \Sigma)\]
Theorem 2 (Multivariate central limit theorem) Let \(\hat{\theta}\) be a maximum likelihood estimator and \(\theta\) be the true parameter value. Then \(\hat{\theta}\) is asymptoticall normal with limiting variance equal to the inverse Fisher information, i.e., \[\sqrt{n}(\hat{\theta}-\theta) \to N(0, I(\theta)^{-1})\]
The estimated covariance matrix of the parameter estimates an be found by calling
| Diaphragm use | Urinary tract infection | |
|---|---|---|
| Yes | No | |
| Yes | 7 | 0 |
| No | 140 | 290 |
Model: \(P(\text{Urinary tract infection}) = F(\beta_0+\beta_11[\text{Diaphragm use}]),\) where \(F\) is equal to, e.g., the logistic CDF.
Example of perfect separation
Definition 1 A binary regression model is quasi-perfectly separated if \(\beta^T X_0 \geq 0\) and \(\beta^T X_1 \leq 0\).
Theorem 3 The maximum likelihood estimators of a binary regression model are finite if and only the regression model is not quasi-perfectly separated.
More detailed explanations can be found here.
| caffeine | n | a | prob | |
|---|---|---|---|---|
| 0 | 0.0 | 30.0 | 10.0 | 0.33 |
| 1 | 50.0 | 30.0 | 13.0 | 0.43 |
| 2 | 100.0 | 30.0 | 17.0 | 0.57 |
| 3 | 150.0 | 30.0 | 15.0 | 0.50 |
| 4 | 200.0 | 30.0 | 10.0 | 0.33 |
A researcher wishes to know if caffeine improves performance on a memory test. Volunteers consume different amounts of caffeine from 0 to 500 mg, and their score on the memory test is recorded. Source.
| caffeine | n | a | prob | |
|---|---|---|---|---|
| 0 | 0.0 | 30.0 | 10.0 | 0.33 |
| 1 | 50.0 | 30.0 | 13.0 | 0.43 |
| 2 | 100.0 | 30.0 | 17.0 | 0.57 |
| 3 | 150.0 | 30.0 | 15.0 | 0.50 |
| 4 | 200.0 | 30.0 | 10.0 | 0.33 |
n giving us the number of observations with the given covariate levels and a the numbe who got an A!import matplotlib.pyplot as plt
logistic = lambda x: np.log(x) - np.log(1-x)
plt.plot(data.caffeine, logistic(model.predict()))
plt.scatter(data.caffeine, logistic(data.prob))<matplotlib.collections.PathCollection at 0x1f8a99ac490>
<matplotlib.collections.PathCollection at 0x1f8aa144a30>
model_log_probit = smf.glm(
"a + I(n - a) ~ caffeine + np.log(caffeine + 1)",
family=sm.families.Binomial(sm.genmod.families.links.probit()),
data=data).fit()
model_log_cauchit = smf.glm(
"a + I(n - a) ~ caffeine + np.log(caffeine + 1)",
family=sm.families.Binomial(sm.genmod.families.links.cauchy()),
data=data).fit()
model_log_cloglog = smf.glm(
"a + I(n - a) ~ caffeine + np.log(caffeine + 1)",
family=sm.families.Binomial(sm.genmod.families.links.cloglog()),
data=data).fit()
{"logit" : model_log.aic, "probit" : model_log_probit.aic, "cauchit" : model_log_cauchit.aic, "cloglog": model_log_cloglog.aic}{'logit': 45.844885174715465,
'probit': 45.37382082095568,
'cauchit': 52.37445834568989,
'cloglog': 46.901711989146094}
predict method gives you a probability.predict and \(\hat{Y}\) is the predicted value of \(Y\).admission = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
admission.head()
model = smf.logit("admit ~ gre + gpa + C(rank)", data = admission).fit()
plt.hist(model.predict())Optimization terminated successfully.
Current function value: 0.573147
Iterations 6
(array([31., 69., 62., 67., 66., 40., 26., 20., 8., 11.]),
array([0.05878643, 0.12674861, 0.19471079, 0.26267297, 0.33063516,
0.39859734, 0.46655952, 0.5345217 , 0.60248388, 0.67044606,
0.73840825]),
<BarContainer object of 10 artists>)
The confusion matrix shows shows true positives, false negatives, false positives, and true negatives.
| \(\hat{Y}=1\) | \(\hat{Y}=0\) | |
|---|---|---|
| \(Y=1\) | True positive | False negative |
| \(Y=0\) | False positive | True negative |
admission with c = 0.5| \(\hat{Y}=1\) | \(\hat{Y}=0\) | |
|---|---|---|
| \(Y=1\) | 30 | 97 |
| \(Y=0\) | 19 | 254 |
admission with c = 0.6| \(\hat{Y}=1\) | \(\hat{Y}=0\) | |
|---|---|---|
| \(Y=1\) | 13 | 144 |
| \(Y=0\) | 19 | 266 |
| \(\hat{Y}=1\) | \(\hat{Y}=0\) | |
|---|---|---|
| \(Y=1\) | 30 | 97 |
| \(Y=0\) | 19 | 254 |
\[
\text{true positive rate}=\frac{\text{true positives}}{\text{true positives} + \text{false negatives}}
\] Also known as sensitivity, recall, and hit rate. Abbreviated as tpr.
\[ \text{false positive rate}=\frac{\text{false positives}}{\text{false positives} + \text{true negatives}} \]
Also known as fall-out. Abbreviated as fpr.
| \(\hat{Y}=1\) | \(\hat{Y}=0\) | |
|---|---|---|
| \(Y=1\) | 30 | 97 |
| \(Y=0\) | 19 | 254 |
The receiver operating characteristic curve. Source: Wikipedia.
Remember the setup for the \(R^{2}\) for non-linear regression: \[ 1-\frac{\sum_{i=1}^{n}(y_{i}-f(x;a_{1},\ldots a_{p}))^{2}}{\sum_{i=1}^{n}(y_{i}-m)^{2}}\] In general we could write \[R^{2}=1-\frac{R(x)}{R(m)}\] Where
We use the log-likelihood instead of the quadratic loss! So, letting \(p_i = f(x;a_{1},\ldots a_{p})\), define the fitted likelihood \[l(x) = R(x) = \sum_{i=1}^{n}-y_{i}\log \hat{p}_{i}-(1-y_{i})\log(1-\hat{p}_{i}).\]
\[l(m) = R(m) = \sum_{i=1}^{n}-y_{i}\log m-(1-y_{i})\log(1-m),\] is the log likelihood of the null model. It measures how well we can predict \(y\) when we don’t know any \(x\) at all.
Now define the pseudo-\(R^2\) or McFadden \(R^2\) as \[R^2_{\textrm{McFadden}} = 1 - \frac{l(x)}{l(m)}.\]
probit_fit = smf.probit("admit ~ gre + C(rank) + gpa", data = admission).fit()
logit_fit = smf.logit("admit ~ gre + C(rank) + gpa", data = admission).fit()
def rsq_mcfadden(fit):
lower = fit.llnull
upper = fit.llf
return 1 - upper / lower
print(rsq_mcfadden(probit_fit), rsq_mcfadden(logit_fit))Optimization terminated successfully.
Current function value: 0.573016
Iterations 5
Optimization terminated successfully.
Current function value: 0.573147
Iterations 6
0.08313059666890721 0.08292194470084713
Thus the \(R^2\)s are \(0.0831\) and \(0.0829\). But we may also access McFadden’s \(R^2\) directly: print(probit_fit.prsquared, logit_fit.prsquared).
Question: Can you compare these \(R^2\)s to the \(R^2\)s from linear regression?
The following picture, Figure 5 from Chapter 5 of McFadden’s Urban Travel Demand: A Behavioral Analysis (1996) illustrates the typical relationsship between the least squares \(R^2\) (on the \(y\)-axis) and the pseudo-\(R^2\) on the \(x\)-axis.
McFadden’s comparison if the ordinary R^2 and McFadden’s R^2. Source: McFadden, D. Urban Travel Demand: A Behavioral Analysis (1996)